Biostat 200B Homework 4

Due Feb 9 @ 11:59PM

Author

Ziheng Zhang_606300061

Question A.1

How do we interpret the coefs of the interaction terms? Compare these parameter estimates to those from the separate models.

Answer:

The SAS codes are as follows:

The parameter estimates for the model containing interaction terms are as follows:

The fitted model is \[\begin{align*} \hat{risk}&=\hat{\beta_0}+\hat{\beta}_{\text{regnc}}regnc+\hat{\beta}_{\text{regs}}regs+\hat{\beta}_{\text{regw}}regw+\hat{\beta}_{\text{length}}length+\\ &\quad \hat{\beta}_{\text{nclength}}nclength+\hat{\beta}_{\text{slength}}slength+\hat{\beta}_{\text{wlength}}wlength\\ &=\hat{\beta_0}+\hat{\beta}_{\text{regnc}}regnc+\hat{\beta}_{\text{regs}}regs+\hat{\beta}_{\text{regw}}regw+\\ &\quad (\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{nclength}}regnc+\hat{\beta}_{\text{slength}}regs+\hat{\beta}_{\text{wlength}}regw)length \end{align*}\]

Coefficient for nclength = 0.30337, which means that for every one increase in average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the North Central region will be 0.30337 percents more than that in the reference region (North East).

Coefficient for slength = 0.43930, which means that for every one increase in average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the South region will be 0.43930 percents more than that in the reference region (North East).

Coefficient for wlength = -0.29237, which means that for every one increase in average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the West region will be 0.29237 percents less than that in the reference region (North East).

The parameter estimates for the separate models are as follows:

The coefficients in the separate model region 2 for length is 0.60893, which is equal to the coefficient for the interaction terms nclength in the original model, 0.30337, plus the coefficient for length in the original model, 0.30556.
The intercept in the separate model region 2 for length is -1.50280, which is equal to the intercept in the original model, 1.47235, plus the coefficient for regnc, -2.97515.
\(\hat{\beta}_{\text{length}}' = \hat{\beta}_{\text{nclength}} + \hat{\beta}_{\text{length}} = 0.30337 + 0.30556 = 0.60893\) \(\hat{\beta_0}' = \hat{\beta_0} + \hat{\beta}_{\text{regnc}} = 1.47235 - 2.97515 = -1.50280\)

The coefficients in the separate model region 3 for length is 0.74486, which is equal to the coefficient for the interaction terms slength in the original model, 0.43930, plus the coefficient for length in the original model, 0.30556.
The intercept in the separate model region 3 for length is -2.91928, which is equal to the intercept in the original model, 1.47235, plus the coefficient for regs, -4.39164.
\(\hat{\beta}_{\text{length}}' = \hat{\beta}_{\text{slength}} + \hat{\beta}_{\text{length}} = 0.43930 + 0.30556 = 0.74486\) \(\hat{\beta_0}' = \hat{\beta_0} + \hat{\beta}_{\text{regs}} = 1.47235 - 4.39164 = -2.91928\)

The coefficients in the separate model region 4 for length is 0.01319, which is equal to the coefficient for the interaction terms wlength in the original model, -0.29237, plus the coefficient for length in the original model, 0.30556.
The intercept in the separate model region 4 for length is 4.27421, which is equal to the intercept in the original model, 1.47235, plus the coefficient for regw, 2.80186.
\(\hat{\beta}_{\text{length}}' = \hat{\beta}_{\text{wlength}} + \hat{\beta}_{\text{length}} = -0.29237 + 0.30556 = 0.01319\) \(\hat{\beta_0}' = \hat{\beta_0} + \hat{\beta}_{\text{regw}} = 1.47235 + 2.80186 = 4.27421\)

Question A.2

How would we test whether the slope coef for hospitals in the North Central region is equal to the slope coef for hospitals in the South region? Run this test.

Answer:

The SAS codes and test resutls are as follows:

For the North Cetnral region, the slope coef for length is \(\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{nclength}}\), and for the South region, the slope coef for length is \(\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{slength}}\). Since we want to test whether the slope coef for hospitals in the North Central region is equal to the slope coef for hospitals in the South region, then we only need to test whether \(\hat{\beta}_{\text{nclength}} = \hat{\beta}_{\text{slength}}\). Therefore, the null hypothesis \(H_0: \beta_{\text{nclength}} = \beta_{\text{slength}}\) is tested against the alternative hypothesis \(H_1: \beta_{\text{nclength}} \neq \beta_{\text{slength}}\).
From the F-test, \(F(1,105)\), we can see that the p-value is \(0.5374>0.05\), so we do not reject the null hypothesis and conclude that there is no significant evidence that the slope coef for hospitals in the North Central region is not equal to the slope coef for hospitals in the South region.

Question A.3

How would we test whether the slope coef for hospitals in the West region is equal to the slope coef for hospitals in the North East region? Run this test

Answer:

The SAS codes and test resutls are as follows:

For the North East region, the slope coef for length is \(\hat{\beta}_{\text{length}}\), i.e. regnc=regs=regw=0, and for the West region, the slope coef for length is \(\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{wlength}}\). Since we want to test whether the slope coef for hospitals in the North East region is equal to the slope coef for hospitals in the West region, then we only need to test whether \(\hat{\beta}_{\text{wlength}} = 0\). Therefore, the null hypothesis \(H_0: \beta_{\text{wlength}} = 0\) is tested against the alternative hypothesis \(H_1: \beta_{\text{wlength}} \neq 0\).
From the F-test, \(F(1,105)\), we can see that the p-value is \(0.3147>0.05\), so we do not reject the null hypothesis and conclude that there is no significant evidence that the slope coef for hospitals in the North East region is not equal to the slope coef for hospitals in the West region.

Question A.4

How do these regression coefficients compare to the previous ones, with length not centered? Interpret each regression coef, including the intercept.

Answer:

The SAS codes and parameter estimates for the centered model are as follows:

Coefficients for lengthc, nclengthc, slengthc, and wlengthc are the same as the previous ones. Intercept and coefficients for regnc, regs, and regw change. The fitted model for the centered model is: \[\begin{align*} \hat{risk}&=\hat{\beta_0}'+\hat{\beta}_{\text{regnc}}'regnc+\hat{\beta}_{\text{regs}}'regs+\hat{\beta}_{\text{regw}}'regw+\\ &\quad (\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)(length-\bar{length})\\ &=\hat{\beta_0}'+\hat{\beta}_{\text{regnc}}'regnc+\hat{\beta}_{\text{regs}}'regs+\hat{\beta}_{\text{regw}}'regw-\\ &\quad (\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)\bar{length}+\\ &\quad (\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)length\\ &=\hat{\beta_0}+\hat{\beta}_{\text{regnc}}regnc+\hat{\beta}_{\text{regs}}regs+\hat{\beta}_{\text{regw}}regw+\\ &\quad (\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{nclength}}regnc+\hat{\beta}_{\text{slength}}regs+\hat{\beta}_{\text{wlength}}regw)length \end{align*}\]

Therefore, \[\begin{align*} &\hat{\beta_0}'+\hat{\beta}_{\text{regnc}}'regnc+\hat{\beta}_{\text{regs}}'regs+\hat{\beta}_{\text{regw}}'regw-(\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\\ &\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)\bar{length}=\hat{\beta_0}+\hat{\beta}_{\text{regnc}}regnc+\hat{\beta}_{\text{regs}}regs+\hat{\beta}_{\text{regw}}regw \end{align*}\]

Also, \[\begin{align*} &(\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)length=\\ &(\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{nclength}}regnc+\hat{\beta}_{\text{slength}}regs+\hat{\beta}_{\text{wlength}}regw)length \end{align*}\]

So we have: \[\begin{align*} &\hat{\beta}_{\text{length}}'=\hat{\beta}_{\text{length}}\\ &\hat{\beta}_{\text{nclength}}'=\hat{\beta}_{\text{nclength}}\\ &\hat{\beta}_{\text{slength}}'=\hat{\beta}_{\text{slength}}\\ &\hat{\beta}_{\text{wlength}}'=\hat{\beta}_{\text{wlength}}\\ &\hat{\beta_0}'=\hat{\beta_0}+\hat{\beta}_{\text{length}}\bar{length}\\ &\hat{\beta}_{\text{regnc}}'=\hat{\beta}_{\text{regnc}}+\hat{\beta}_{\text{nclength}}\bar{length}\\ &\hat{\beta}_{\text{regs}}'=\hat{\beta}_{\text{regs}}+\hat{\beta}_{\text{slength}}\bar{length}\\ &\hat{\beta}_{\text{regw}}'=\hat{\beta}_{\text{regw}}+\hat{\beta}_{\text{wlength}}\bar{length} \end{align*}\]

Coefficient for intercept = 4.42042 = 1.47235+0.30556 \(\times\) 9.648 and the interpretation is: when the value of length is equal to its mean value, and the values of regnc, regs, regw, nclengthc, slengthc, and wlengthc are all equal to zero, i.e. the reference region, the estimated mean risk for the reference region (North East) is 4.42042 percents. However, this is only a meaningful interpretation if x=0 is reasonable.

Coefficient for lengthc = 0.30556 and the interpretation is: for every one increase in the length, the average length of stay of all patients in hospital (in days), the estimated mean risk in reference region (North East) will increase by 0.30556 percents.

Coefficient for nclengthc = 0.30337 and the interpretation is: for every one increase in the length, the average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the North Central region will be 0.30337 percents more than that in the reference region (North East).

Coefficient for slength = 0.43930 and the interpretation is: for every one increase in the length, the average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the South region will be 0.43930 percents more than that in the reference region (North East).

Coefficient for wlength = -0.29237 and the interpretation is: for every one increase in the length, the average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the West region will be 0.29237 percents less than that in the reference region (North East).

Coefficient for regnc = -0.04825 = -2.97515+ 0.30337 \(\times\) 9.648 and the interpretation is: when the value of length is equal to its mean value, the estimated mean risk in the North Central region is 0.04825 less than than that in the reference region (North East).

Coefficient for regs = -0.15325= -4.39164+ 0.43930 \(\times\) 9.648 and the interpretation is: when the value of length is equal to its mean value, the estimated mean risk in the South region is 0.15325 less than that in the reference region (North East).

Coefficient for regw = -0.01893= 2.80186-0.29237 \(\times\) 9.648 and the interpretation is: when the value of length is equal to its mean value, the estimated mean risk in the West region is 0.01893 less than that in the reference region (North East).

Question B.1

Log transformations: Fit the following models and provide the parameter estimate output (coefs, SEs, p-values), and interpret the regression coefficient associated with the predictor variable. Also, comment briefly on the model fit (residual) diagnostics as to whether the model appears to be a good fit to the data and whether the assumptions of the model are met.

  1. Regress log10 SpikeIgG (Y) on days PSO (X).
  2. Regress ln SpikeIgG (Y) on days PSO (X).
  3. Regress days PSO on ln(age).

Answer:

  1. The fitted model is as follows:

The parameter estimate output and the residual diagnostics are as follows:

Interpretation for coefficient of days PSO: The effect of a one-unit increase in days PSO would be to multiply the estimated mean Spike IgG by \(10^{-0.00192}=0.996\), i.e., a 0.4% decrease.
From the plots above, it seems that \(E(\epsilon_{i}) = 0\) for all \(\textit{i}\) and the residual plot does not show any special pattern. And the error variance are roughly constant across all observations. Also, from QQ plot, all points roughly follow a straight line. So the assumptions of the normal error regression model are approximately met but the model is not a good fit to the data because R-Square (\(R^2\)) is only 0.04.

  1. The fitted model is as follows:

The parameter estimate output and the residual diagnostics are as follows:

Interpretation for coefficient of days PSO: The effect of a one-unit increase in days PSO would be to multiply the estimated mean Spike IgG by \(e^{-0.00442}=0.996\), i.e., a 0.4% decrease.
From the plots above, it seems that \(E(\epsilon_{i}) = 0\) for all \(\textit{i}\) and the residual plot does not show any special pattern. And the error variance are roughly constant across all observations. Also, from QQ plot, all points roughly follow a straight line. So the assumptions of the normal error regression model are approximately met but the model is not a good fit to the data because R-Square (\(R^2\)) is only 0.04.

  1. The fitted model is as follows:

The parameter estimate output and the residual diagnostics are as follows:

Interpretation for coefficient of ln(age): For every 1% increase in age, the estimated mean days PSO will increase by \(0.01*16.13521=0.16\).
From the plots above, it seems that \(E(\epsilon_{i}) = 0\) for all \(\textit{i}\) and the residual plot does not show any special pattern. And the error variance are roughly constant across all observations. However, from QQ plot, all points do not follow a straight line and there is a special pattern. So the assumptions of the normal error regression model are not met and the model is not a good fit to the data because R-Square (\(R^2\)) is only 0.0072.

Question B.2

Log-log model: Using the covid_immune data, fit and interpret a log(Y)-log(X) model for the relationship between SpikeIgG and SpikeIgA. This should include doing the following:

  1. Produce and examine the distributions of SpikeIgG and SpikeIgA, on their original scale and after log transformation.
  2. Produce a scatterplot of SpikeIgG (y) versus SpikeIgA (x) with a loess smooth and the linear model fit.
  3. Interpret the coefficient associated with log(SpikeIgA) from the log-log model.
  4. Use residuals to check whether the log-log model is a reasonable fit to the data.

Answer:

  1. The distribution of SpikeIgG and SpikeIgA on their original scale are as follows:

    The distribution of SpikeIgG and SpikeIgA is right-skewed.
After log transformation, the distribution of log(SpikeIgG) and log(SpikeIgA) are as follows:

The distribution of log(SpikeIgG) and log(SpikeIgA) is approximately normal.

  1. The scatterplot of SpikeIgG (y) versus SpikeIgA (x) with a loess smooth and the linear model fit is as follows:

    It is obvious that \(x\) and \(y\) values are concentrated in a small-value region. The linear model fit is not good.

  2. The fitted model is as follows:

    Interpretation for coefficient of log(SpikeIgA): For every 1% increase in SpikeIgA, the estimated mean SpikeIgG will increase by 0.47350%.

  3. The residual plots are as follows:

    From the plots above, it seems that \(E(\epsilon_{i}) = 0\) for all \(\textit{i}\) and the residual plot does not show any special pattern. And the error variance are roughly constant across all observations. Also, from QQ plot, all points roughly follow a straight line. R-Square (\(R^2\)) is only 0.2778. So the assumptions of the normal error regression model are approximately met and the log-log model is a reasonable fit to the data.

Question B.3

Interaction between a categorical and continuous variable. For this problem, we will use interactions to explore whether the rate of exponential decay of SpikeIgG depends on the individual’s peak disease severity.

  1. In the covid_immune dataset, I created a variable peakDiseaseSeverity that is coded as 1 = asymptomatic or mild, 2 = moderate, 3 = severe. Determine the sample sizes in each category that have data for both daysPSO and SpikeIgG.
  2. Fit a model with an interaction between daysPSO and peak disease severity. Use asymptomatic/mild as the reference category. The outcome variable should be log-transformed SpikeIgG.
  3. Conduct a joint test of whether the two interaction terms are equal to zero (this will be an F test).
  4. Regardless of the conclusion of the test for interaction, obtain point estimates of the half-lives of SpikeIgG for each of the 3 disease severity groups.

Answer:

  1. The sample sizes in each category that have data for both daysPSO and SpikeIgG are as follows:

    For category 1 = asymptomatic or mild, the sample size is 206. For category 2 = moderate, the sample size is 9. For category 3 = severe, the sample size is 12.

  2. The fitted model is as follows:

  3. The joint test of whether the two interaction terms are equal to zero is as follows:

    Null hypothesis: \(H_0: \beta_\text{modays} = \beta_\text{sedays} = 0\); Alternative hypothesis: \(H_1:\) At least one of \(\beta_\text{modays}\) and \(\beta_\text{sedays}\) is not equal to 0.
    From the F-test, \(F(2,222)\), we can see that the p-value of the joint test is \(0.5511>0.05\), so we do not reject the null hypothesis and conclude that there is no significant evidence that at least one the two interaction terms is not equal to zero.

  4. The fitted model is \[\begin{align*} log(\hat{SpikeIgG}) &= 6.78870 +1.53034 \times dismo +2.52143 \times disse - 0.00489 \times daysPSO + \\ &\quad 0.00096 \times dismo*daysPSO - 0.00734 \times disse*daysPSO \\ &= 6.78870 +1.53034 \times dismo +2.52143 \times disse +\\ &\quad(- 0.00489+ 0.00096 \times dismo- 0.00734 \times disse) * daysPSO \end{align*}\]

So when peakDiseaseSeverity is 1 = asymptomatic or mild, i.e. dismo=disse=0, the coefficient of daysPSO is -0.00489.
When peakDiseaseSeverity is 2 = moderate, i.e. dismo=1, disse=0, the coefficient of daysPSO is -0.00489+0.00096=-0.00393.
When peakDiseaseSeverity is 3 = severe, i.e. dismo=0, disse=1, the coefficient of daysPSO is -0.00489-0.00734=-0.01223.
Since the point estimates of the half-lives of SpikeIgG is \(t_{1/2}=\frac{\mathrm{ln}2}{-\beta}\), where \(\beta\) is the coefficient of daysPSO, then the point estimates of the half-lives of SpikeIgG for each of the 3 disease severity groups are as follows:

  • For category 1 = asymptomatic or mild, the point estimate of the half-life is \(t_{1/2}=\frac{\mathrm{ln}2}{0.00489}=141.7\) days.

  • For category 2 = moderate, the point estimate of the half-life is \(t_{1/2}=\frac{\mathrm{ln}2}{0.00393}=176.4\) days.

  • For category 3 = severe, the point estimate of the half-life is \(t_{1/2}=\frac{\mathrm{ln}2}{0.01223}=56.7\) days.